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Abstract 

We provide a novel search technique, which uses a hierarchical model 
• and a mutual information gain heuristic to efficiently prune the search 

^ space when localizing faces in images. We show exponential gains in 

1 I computation over traditional sliding window approaches, while keeping 

similar performance levels. 
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2!; 1 Introduction 

ly-j In recent years, face detection algorithms have provided extremely accurate 

methods to localize faces in images. Typically, these have involved the use of 
a strong classifier, which estimates the presence of a face given a particular 
subwindow of the image. Successful classifiers have used Boosted Cascades 
[251 [221 nil EH, Neural Networks [2Il[l9l[ig and SVM's [201124] among others. 

In order to localize faces, the aforementioned algorithms have relied on a 
sliding window approach. The idea is to inspect the entire image by sequentially 
p% observing each and every location a face may be in by using a classifier. In most 

^ face detection algorithms [221 [T71 [191 121] , this involves inspecting all pixels of 

the image for faces, at all possible face sizes. This exhaustive search, however, 
is computationally expensive and in general not scalable to large images. For 
example, for real-time face detection using modern cameras (4000 x 3000 pixels 
per image), more than 100 million evaluations are required, making it hopeless 
on any standard computer. 

To overcome this problem, previous works in object and face localization 
have simply reduced the pose space by allowing only a coarse grid of possible 
locations [51 [HJ [2S]. An elegant improvement to object detection was pro- 
posed in [22] where "feature-centric" evaluation are performed, as opposed to 
"window-centric" , allowing previous computation to be reused. Such a method 
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however relies on strong knowledge of the classifier used. More recently, a glob- 
ally optimal branch-and-bound subwindow search method for objects in images 
was proposed [15] and extended to videos [28]. Here, the classifier and the fea- 
ture space used to locate the object are dependent on a single robust feature 
{e.g. SIFT [IB]), making it difficult to use in the context of faces. 

In this paper, we propose a novel search strategy, which can be combined 
with any face classifier, in order to significantly reduce the computational cost 
involved with searching the entire space. The design principle is as follows: We 
assume that a perfect face classifier is available, i.e. one which always provides 
the correct answer. In practice however, such a classifier does not exist and an 
accurate one (as in [551 HZl HH HZj) will be used instead. Our goal is then to 
reduce the total number of classifier evaluations required to detect and locate 
faces in images, while still providing similar performance levels when compared 
with an exhaustive search. 

A proposed strategy for computational shape recognition !8 , argues that 
the task of visually recognizing an object can be accomplished by querying the 
image in a sequential and adaptive way. In general, this can be regarded as 
a coarse-to-fine approach to perception [T] |25J E| [7] . This "twenty questions" 
approach can be described as follows: there is a fact to be verified, e.g. "is 
there a face in the field of view" , and each query, which consists of evaluating 
a particular function of the image, is chosen to maximally reduce the expected 
uncertainty about this fact. In the context of computer vision, such approaches 
have led to two different types of search algorithms: offline and online. In 
the offline versions, the "where to look next" strategy is computed once and 
for all, anticipating all possible queries. It has led to efficient algorithms for 
symbol recognition [T], face [B] and cat detection. In the online version, the 
strategy is computed sequentially, as information is gathered. It has led to a 
road tracking algorithm [SlU]: this approach is known as Active Testing. 

In this paper, we extend the active testing framework in order to do fast face 
detection and localization. We provide a way to ask questions that are general 
and specific with regard to the face pose, and span different feature spaces. 
Similarly to the "twenty questions" game, questions such as "is the object at this 
location with this size?" are asked by means of an accurate face classifier [55] [231 
dni HZ], independently of what features are used to guide the search. We show 
here that this approach provides a coherent framework, with few parameters to 
choose or tune, which significantly reduces the number of classifier evaluations 
necessary to localize faces. Comparison of our method with state-of-the-art face 
detection algorithms, and the traditional sliding window approach, indicate that 
our framework reduces, by several orders of magnitude, the number of classifier 
evaluation needed while maintaining similar accuracy levels on localization and 
detection tasks. Even though this paper specifically focuses on frontal faces, 
this approach can be extended to faces in general [TH [T31 [U [THl [5] , other object 
categories [5] and to most classifiers in the machine learning literature. 

The remainder of this paper is organized as follows: in section [2] the gen- 
eral framework of our method is presented along with implementation details. 
Section |3] describes localization experiments, and in section [4] we compare the 
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Figure 1: Each node in the tree (a) corresponds to subwindow in the image (b). 
The root of the tree, Ai,i, represents the entire image space and has four children 
(A2,i, A2,2, A2,3, A2,4). (c) Example Query: Here, the face center is, Y — I £ Mj. The 
query Xi j, counts the proportion of edges in a window twice the size of Ai,j, centered 
on Ai,j. k indicates that we count the proportion of edges on a surface twice the size 
of the subwindow while provides the pose subset in A. 

performance with state-of-the-art methods on a detection and localization task. 
Concluding remarks are provided in section [5] 

2 Active Testing 

The goal set forth is to detect and localize a single frontal face of unknown 
size, which may or may not be present in the image. We define the pose of a 
face, as the pixel location of the face center and a face scale. That is, we treat 
localization as placing a bounding box around a face. In section [4j we detail 
how this can be extended to searching for multiple faces. 

Active Testing (AT) can be regarded as a search algorithm which uses an 
information gain heuristic in order to find regions of the search space which 
appear promising. The region which is to be observed next is determined as 
information is gathered, and thus can be viewed as an online variation of the 
"twenty questions" game. The general approach is as follows: we are looking 
for a face in an image, and are provided with a set of questions which help 
us determine where the face is located. Questions are answered with some 
uncertainty, reducing the search space and eventually leading to the face pose. 

In addition, it is also assumed that a special question regarding the exact 
face pose is available. This question is treated as an "Oracle" , always providing 
a perfect answer when queried but is computationally expensive relative to other 
questions. Querying the oracle at every location would provide the face pose 
but is expensive and inefficient as certain questions are more informative than 
others and help reduce the search space faster. Consequently, a sub-goal is to 
determine face pose with as few questions as possible. 
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2.1 Model and Algorithm 

Let Y — {L, S) be a discrete random variable defining the face pose; where L is 
the location of the face center {i.e. pixel coordinates), and S is the face scale, 
such that S can take values {1, . . . , M} corresponding to M face size intervals. 
Additionally, Y can take one extra value when the face is not in the image. Let 

A = {A,J,^-l,...,i?,J = l,...,4^-l} 

be a quadtree of finite size, which decomposes the image space; i indexes 
the level in the tree and j designates the cell at that level (see figure [l|a)). 
Every leaf is associated with a pixel in the image and each non-terminal node 
corresponds to a unique subwindow in the image, representing a subset of poses 
(figure [ijb)). When no face is present in the image then Y G Ai i, where Ai i 
denotes the complement of Ai i. 

We are interested in refining the estimate of where the face is located iter- 
atively and hence denote ttj as the probability density of Y at iteration step 
t. Let Uij^s = P{L € Aij, 5 = s),Aij- C A, s € {l,...,Af}. By construc- 
tion, calculating be achieved by summing the probability of A^j 's 
children. Clearly, wi^i.s = ^2,1.3 + "2,2.3 + 1*2,3.3 + 1*2,4.3 and similarly for any 
other Ui j s- For any node, we also denote Ui j = 7r(Ajj) = X^fli '"j.i,s- Let 
X = {X^, . . . ,X^} be a set of question families, such that for each family k, 
X'' = {X^j, i — 1, . . . , D, j = 1, . . . , 4'^^}, where X^j is a query from family k, 
about the pose subset A^j . 

The generic AT algorithm (algorithm 1) can then be seen as the following: 
to begin, ttq and the first query are initialized (lines 1 to 2). Three operations 
are then repeated: the response is observed (line 4); the belief of the location 
of Y is updated using the latest observation (line 5) ; a new query is chosen for 
the next iteration (line 6). The iteration is stopped when a terminating criteria 
is achieved (line 7). Each line is explained in detail in the following sections. 



Algorithm 1 Active Testing (AT) 
1: Initialize: i -s— 1, j 1, -s— 1, t 
2: Initialize: 7ro(Ai_i) = 7ro(Ai^i) = ^ 
3: repeat 

4: Compute the test x = X^j 

5: Compute TTf+i using tt^ and x 

6: Choose the next subwindow and test: 

{i,j,k} = arg max I{Y;X^,'.,) 

i',j',k' 

7: until H{TTt+i) > 1 — e and/or t < 7. 
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2.2 Queries 

The AT algorithm requires a set of query families, X = ^ ■ ■ ■ to be 

specified. Each query family, X*^ , consists of evaluating a specific type of image 
functional indexed by k. Members of a family, X"^ — {Xfj,i = l,...,D,j = 
1, . . . ,4*"^}, are indexed by a pose index in A (as in [7i). That is, X^j is an 
image functional, where k defines a particular computation and specifies 
the pose subset. Note that these queries are generic and need not be binary. 
Example queries can be seen in Figure [ijc) . 

In addition, perfect tests - which precisely predict the presence of a face 
by using a classifier - are included in X. When this test is used at a specific 
pose, either the classifier responds positively and the face is deemed found, or 
conversely, the response is negative and the face is assumed not to be at this 
pose. That is, we assume no uncertainty with regard to the response of this 
classifier. 

In order to specify the joint distribution between the face pose Y and queries 
A", we make the following heuristic assumptions: 
Conditional Independence 

P {{X^^ =x},i = l...D,j = l... 4^-1, k = l...K\Y^{l, s)) 
Homogeneity 

PiXt^^.lY = il,s))^[fP^ !It{.^^; (2) 

Here characterizes the "response" to the query X^j when the center of the 
face is within A^^ with size s. Similarly, /q is the "response" when the center 
is not in Aij. Additionally, even though KN queries are specified, where N is 
the number of nodes in A, the number of densities needed is only KD. That is, 
for each test family, only one density per level of A needs to be specified. This 
is why fg{-,i) is only indexed by i. 

Note that these assumptions are a simple way to make the problem tractable: 
for example, the conditional independence of queries given the location of the 
object Y assumption is clearly a simplification as the same pixel values are 
used to compute many queries at different levels of A. Similarly, the actual 
responses to tests might in fact depend on the precise location of the face within 
Aij . The homogeneity assumption simplifies the response model by assuming 
a single model for all cases. Even when using these assumptions however, the 
experiments conducted here (sections[3]and|4]) indicate that these simplifications 
provide a good way to solve the problem at hand. In addition, this model should 
be taken into account when choosing queries to use: similarly to a Naive Bayes 
model, queries should be individually informative. 
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2.3 Belief Update 



Once an observation has been made, the new distribution of the face loca- 
tion Y must be calculated (line 5 of AT). At initialization (line 1 of AT), 
7ro(Ai^i) = 7ro(Ai_i) — i, indicating that a face is believed to be in the image 
with probability 1/2. Note that the probability 7ro(Ai^i) is uniformly distributed 
within Al l by construction. Given ttj and the query response X^j = a; at time 
step i, the updated distribution ivt+i can then be calculated by using Bayes 
formula 

. _ P{Xt,^x\Y={l,s))Ml,s) 

"'^'^ ' E.. /. p (^5 = - -')) -di', s')di' 

Using assumptions [T] and [2] then 

P {Xl^ = x\Y = (Is)) = f^{x,t)lj,^ Jl) + /,'=(x,z)Ia,,^(0 (4) 
Let us now define the likelihood ratio as 



r 



ix,s) = §^ s = l...M (5) 



then equation [3] can be written as, 

^t+iil,s) = ^ (Ia^JO + IlA,.^(Or(x,s)) 7Ttil,s) (6) 
where Z{x) is the normalizing constant. 



M 

Z{x) = 7rt(A,,j-) + J2 ""(^^ s)Trt{A,^j) (7) 

s = l 

Note that the evolution from tt^ to nt+i only relies on r{x) and allows for 
probability mass to be shifted onto or away from A.; j , depending on the response 

In order to reduce the number of nodes to update, only a subtree is main- 
tained, where only nodes which have probability greater than some threshold r 
are included. By construction of A, parent nodes have probability equal to the 
sum of their children, hence any node which has probability larger than r also 
has parent with probability greater than r. This guarantees that applying this 
threshold forms a subtree within A containing Ai_i. This approximation of ttj 
allows for a compact representation of the distribution. 



2.4 Query Selection 

We choose to select the next query by maximizing the mutual information gain 
between Y and the possible queries X^j (line 6 of AT). This can be written as 

I{Y;Xl^)=HiXl^)-H{Xl^\Y) (8) 
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where, 



H{X^^^)^h\^u,,,^J^{-)j (9) 

here, h{f) is the differential Shannon entropy of the density /. We simplify 

this expression by substituting h{f) with the Gini Index [IT. The mutual 
information then becomes 



M M 



I(Y; Xl^) = Y.Y. / (/' - (10) 

s=Q m>s 

where Uij^ = 1 — Uij. Note that the term / {fs~fmY the Euclidean distance 
between the densities f'^ and and only needs to be computed once and then 
stored for fast evaluation. 

Since we are interested in choosing both the region j S A and a query fam- 
ily k which maximizes the information gain, one can simply evaluate I{Y\ X^^) 
for all possible values of the triple {i,j,k) and select the parameters providing 
the largest gain. However, as described in section |2.3[ only a small subset of 
poses is ever considered at any iteration. For example, nodes which have little 
probability will surely only provide a small information gain. Consequently, we 



only need to evaluate equation ( 10 1 for the explicitly maintained subtree (Figure 
(TJa)). Additionally once a query has been chosen, it is removed from the set of 
possible queries, further reducing the amount of computation. 



2.5 Terminating Criteria 

At line 7 of the AT algorithm, two terminating criteria are presented: (i) the 
algorithm runs until the entropy of tt, H{'k), is very high, (ii) the algorithm 
iterates for a fixed number of steps, 7. In the first case, running until the 
entropy is high corresponds to two possible outcomes: either a face has been 
found and most of the probability mass is at a single leaf of A or most of the 
mass is outside the image, Ai 1 and no face is believed to be present in the 
image. In general, the choice of which criteria to use ((i), (ii) or both) is for the 
user to decide. Sections |3] and |4] show the behavior of these scenarios. 

In addition, for all cases, the total number of queries is bounded by the size 
of the tree and the number of query families. As the algorithm iterates and 
the classifier is queried, the number of poses with strictly positive probability 
decreases. This provides a guarantee that, in the worst case, the face will be 
found after having observed all the poses. 



2.6 Implementation 

We now provide some implementation details and give a more in depth algorithm 
for updating tt (see algorithm [2]) and choosing queries. 
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Figure 2: Sequence of queries posed by the Active Testing algorithm on a test image 
from the Caltech Frontal Face Dataset. In each image, a test X^j is computed: white 
boxes show the pose, Aij, queried while black boxes show the subimage queried. The 
number indicated in the top left of each image is the iteration number of the AT 
algorithm. In image 3123, the Boosted Cascade is evaluated and a face is found at a 
given scale (green box). 



Before the AT algorithm begins, all features necessary to evaluate queries 
from X for a given image are computed and stored in the form of an integral 
image making the evaluation of a query 0(1) operations (similar to [H]). This 
is particularly efficient since queries Xlj^j compute nested subwindows. 

In order to form and maintain the subtree of A (line 7), only nodes which 
are above a threshold (r = 0.001) are explicitly stored. To do this, we construct 
A as a quadtree, and maintain a frontier set J-. J- consists of any node A^j- 
with Uij > T and with all children having Ui+ij/ < r. Applying this rule at 
each iteration ensures that the maintained subtree is relevant to where the face 
is believed to be located. Additionally, since the probability associated at any 
node in the tree is equal to the sum of its children, we only need to update 
nodes in J^, and recurse through the tree to update the remaining nodes in A. 

After having computed the query X^^, updating any node Ai'ji G is sim- 
ple: if Ai' jv G ^i.ji then Ui'.j' = r{y)uiij> /Z, otherwise Uiiji — Ui'_ji/Z. Doing 
so updates tt as described in equation Q in an efficient way. In addition, at any 
point in the updating of tt, the next best query, S, seen so far is maintained. The 
denominator Z is calculated once and for all, and used to calculate equation |10| 
when each node is visited. Only the best score is kept, and ultimately chosen 
for the following iteration of the AT algorithm. That is, we compute equations 
([6| and ^ one after the other, requiring only one pass through the subtree 
per iteration. 



3 Face Localization 

To demonstrate that this framework can be used to significantly reduce the 
number of classifier evaluations required when searching for a face in an image, 
we begin by evaluating the AT algorithm on a pure localization task (as done 
in |15|). In the following set of experiments, each image contains exactly one 
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Algorithm 2 \Jpdate{Ai',ji,Aij,x,S,J-) 
1: if Aiijr E T then 
2: if Ai' C Aj then 

3: Wi'j' ^ r{x)Ui' J' / Z 

4: else 

5: ^ Ui-j>/Z 

6: end if 

7: Maintain 

8: else 

9: for Each cliild, Ai/+i_j" , of Ai/ j/ do 
10: Update(Ai'+i,j", Aij", a;, S*, J") 
11: end for 

12: Uj' jv 4- Y.j" + 

13: end if 

14: 5 = max [S, maxfe /(F; ^z)) 



face. We describe in section |3.1| the queries used to localize faces. In section 
|3.3| we show how AT performs in terms of time, number of classifier evaluations 
and accuracy. 

We perform the following experiments on the Caltech Frontal Face dataset 
which consists of 450 images (896 x 592 pixels), each containing exactly 
one of 27 different faces in variously cluttered environments and illuminations. 
Face sizes range from approximately 100 to 300 pixels in width. We choose 
M = 4 possible face size intervals ([100, 150], [150, 200], [200, 250], [250, 300]). 
All experiments are conducted on a 2.0 Gigahertz machine. 

3.1 Face Queries 

To locate faces, we first specify the following set of test families, X — {X^, . . . , X^^ } 
and their associated distributions {fs-fo)- the following experiments, K = 
30. 

The first family of tests, X^, calculates the proportion of edge pixels (defined 
and computed as in 1] by means of an edge oriented integral image) in a window 
associated with the pose A^^ . That is, X\ ^ is the proportion of pixels which are 
edges within Ai^i and similarly for all A^ j. Test families X^ to X^ are similar 
to X^, in that they compute the proportion of edge pixels in a window centered 
on Ai J , but of larger size, by a factor F — {2, 3, 4, 5} (see figurejTJc)). Note that 
this factor is different from the scale S. Using these pose-indexed tests provides 
a way to test arbitrarily large regions, even when A^j is a small subwindow. 
These tests also allow for overlapping A^ j regions and more precise estimation 
of the face scale. 

Families X^ to X^ are similar to X^ but compute the proportion of edge 
pixels in a particular direction (four possible directions). Similarly to families 
X^ to X^, families X^° to X^^ allow for a scale factor for tests in a particular di- 
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(a) (b) (c) 

Figure 3: (a) ROC curve of both SW+BC and AT+BC to find a face in the Caltech 
Frontal Face dataset. The performance of both methods is approximately identical, 
(b) Average computation time with varying pose space size. Note that image size is in 
logarithmic scale. The AT algorithm performs in almost logarithmic time compared 
to SW. (c) Average number of classifier evaluations when the pose space increases. 
Additionally, a zoom of the AT performance is provided. 



rection (4 directions x 4 factors) . Using integral images allows for computation 
of these tests with only 4 additions, making them very efficient. 

We choose to model all the for s G {0, ...,M} using Beta distributions. 
The Beta family permits to model a wide range of smooth distributions over the 
interval [0, 1] with only two parameters. The parameters of each distribution 
are determined offline from a small training dataset where the face location and 
scale is known (more details are given in Section 3.2). 

Finally, families {X^^, . . . , X^°} are the perfect tests and involve testing for 
a face using a Boosted Cascade (BC) . Each family specifies testing for a face at 
all scales within a given interval (s € {1, M}). For each interval, we test for 
face sizes in increments of 10% of the smallest face size (total of 13 face sizes 
in the range [100, 300]). In terms of operations, evaluating this test requires on 
average 56 additions, 1 multiplication and 1 comparison, per face size, making 
it significantly more costly than other queries. Since the BC is only informative 
when the pose is very specific, we restrict this test to leaves in A. These BCs 
are trained and provided by OpenCv [12], but modified to restrict testing to 
specific regions and face sizes. Even though better classifiers have recently been 
developed, we choose this one as it is publicly available and widely used. 



3.2 Offline Training 

We choose to model each fs{-,i) with a Beta distribution with parameters 
{a, (3). To do this, we randomly selected 50 images, from the Caltech Frontal 
Face Dataset [26]. Note that far fewer images are used for training here when 
compared to other search methods (see [HI [25) which typically use on the order 
of 10'^ images to train their systems. The estimation of the fg{-,i) parameters 
is broken into two parts. 

We first estimate all the background densities. That is, for each k and i, 
we randomly select 100 j's per image, such that the face center is not in A^ j. 
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We then compute the tests X^j — x, and use these to compute the parameters 
using maximum likehhood estimation with 5000 datapoints. 

To estimate the foreground densities, a similar procedure is used. We de- 
scribe the case s = 1. For each k and i, we randomly select 100 j's in each image 
such that the face center is in Ai j. The parameters of fi{-, i) are then estimated 
from the tests X^j = x. As before 5000 datapoints are used to estimate {a, 13). 
In order to estimate fs{-,i) for 1 < s < 4, we subsample the images and repeat 
the same procedure (similar to [6j). Additionally, the / {f^ — term from 
equation (10 1 is then calculated by using a Monte Carlo approximation, and 
stored in a look-up table. 



3.3 Single Face Localization 

We setup the AT algorithm with BCs (AT-I-BC) to run until a face is found or 
until 5 X 10^ classifier evaluations have been performed (see figure |4] for details 
on how this was chosen) . We compare this with a sliding window approach using 
the identical BCs (SW-I-BC) and letting it run until a face is found or until all 
poses have been observed. Note that both (AT-I-BC) and (SW-I-BC) have the 
same pose space: all pixels and face sizes {e.g. pose space size = 896 x 592 x 13 = 
6895616). In order to avoid any unfair bias as to where faces may be located, 
we randomly pick initial starting locations in the image for (SW-I-BC), looping 
around the image in order to observe all the poses. We report that (AT-I-BC) 
allows for exponential computational gains over the sliding window approach 
while keeping similar performance levels. 

Figure [2] shows a typical behavior of the AT algorithm on a given image. 
In general, the order in which queries are posed is complex and in some cases 
counter-intuitive - validating the need for an online search strategy. 

In figure [sjja) we compare the accuracy of (AT-fBC) and (SW-I-BC) on the 
remaining unused 400 images of the dataset using a ROC curve. We observe that 
generally (AT-I-BC) does not suffer much from a loss in performance compared 
to the brute force sliding window approach. Note that the difference between 
the two methods is not significant. 

To compare how much time (AT-fBC) and (SW+BC) take to locate a face 
depending on the size of the pose space, we randomly selected a subset of 50 
images from the testing set, subsampled these to have images of sizes (112 x 74, 
224 X 148, 448 x 296, 672 x 444, 896 x 592). Figure [sjjb) , shows the average 
time of both methods for each image size. Note that the overhead of (AT-I-BC) 
- the time to evaluate all queries tested, the update mechanism and the query 
selection - is included in this plot (the additional time to compute an integral 
image for oriented edges is not included as it is negligible). As expected, we 
see that (SW+BC) is linear in the number of poses. However, the total time 
(AT-I-BC) takes to complete is significantly lower than (SW-f BC) and even more 
so at large image sizes. In fact, (AT-I-BC) remains almost logarithmic even as 
the number of poses increases. This suggests that AT uses a form of "Divide 
and Conquer" search strategy. Note, that at image sizes smaller than (112 x 74), 
(AT-I-BC) is slower than (SW-I-BC) due to the overhead. 
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(a) (b) (c) 



Figure 4: (a) The proportion of faces detected increases with the number of classifier 
evaluations: 90% of faces are correctly detected with only 10* evaluations and with 
10^ classifier evaluations, the AT algorithm performs as well as SW, but much faster, 
(b) Histogram of the number of classifier evaluations. The dotted black line represents 
the point mass function of the Geometric distribution with parameter p — 1/9248. (c) 
Face image and associated computation image. This gray scaled image indicates the 
number of times each pixel has been included in a queried window. 



Figurejsjjc) shows the average number of classifier evaluations both (AT+BC) 
and (SW+BC) perform, when changing the image size. Notice that the dif- 
ference between (AT+BC) and (SW+BC) is even larger than the difference 
reported in figure , and that the AT algorithm significantly reduces the 
number of classifier evaluations. For the largest image size AT requires 100 
times fewer evaluations than SW. 

In figure |4]ja) we show how the accuracy of (AT+BC) is affected by the 
total number of classifier evaluations allowed. The dotted line indicates the 
performance of (SW+BC) when the entire pose space is observed. We see 
that after observing the entire pose space (O(IO^) evaluations), 98% accuracy is 
achieved. Performance results are shown when (AT+BC) is stopped when either 
a face has been located or after (10^, lO"*, 10^, 10®) classifier evaluations have be 
performed. After only 10* classifier evaluations nearly 90% of detectable faces 
are found. By 10^ evaluations AT performs at the same accuracy level as SW. 
In general, we can see in figure Qb) that the number of evaluations required is 
approximatively Geometric(p = 10"''). Hence, on average 0.0014 of the total 
pose space is evaluated by the classifier. 

As in [T], figure |4][c) shows a randomly selected test image, and the corre- 
sponding computational image associated (right). The computational image is 
a gray scale image, which indicates the number of times each pixel has been 
included in a queried window (all types of queries included). Darker regions 
show areas where little computation has taken place, while white regions shows 
important computation. As expected, we can see that regions of the image 
which contain few features (left part of the image) are not considered for much 
computation. 

4 Face Detection and Localization 

We now test the AT algorithm in a much harder setting - a detection and 
localization task. We do this by looking for faces in the MIT+CMU dataset 
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This dataset contains 130 images, of various sizes, where some images 
contain no faces, and others contain an unknown number of faces. Face sizes 
range between 20 pixels to the width of images. As in the previous experiment, 
we initiahze the AT algorithm similarly to that in section [2] and |3] 

To find multiple face instances, we assume that at any point in time, the 
remaining number of faces to be found in an image follows a Poisson distribution 
with parameter XQ, where Q is the number of pixels unobserved in the image, 
and A is a face rate. We have chosen A = 10"'*, corresponding to one face per 
100x100 pixel image on average and hence 7ro(Ai.i) = e^^'^ . We then run the 
AT algorithm until 7rt(Ai_i) < e = 10~^. When a face is found: edges from the 
detected face region are removed from the integral images and the remaining 
poses are assigned uniform probability. The algorithm is then restarted with 
the updated 7ro(Ai^i). 

Figure shows the ROC curve of both the (AT+BC) and (SW+BC) 
methods on the MIT+CMU dataset. In both cases no post-processing step was 
applied to these results (i.e. No Non-Maximum suppression). First we note 
that the MIT-I-CMU testset is much harder than the Caltech Frontal Face set. 
In general, the performance of the AT algorithm is comparable to the brute 
force approach. There is, however, a slight performance decrease in (AT-I-BC) 
when compared to the exhaustive search. That is, we notice that even though 
the classifier used (BC) is not very good (when compared to state-of- the-art 
classifiers), little accuracy loss is observed when used in the AT framework. 

From this experiment, (AT-I-BC) required 0(10®) classifier evaluations over 
the entire testset, while (SW-I-BC) required O(IO^) evaluations. Figure [sjb), 
shows the number of classifier evaluations required by both (AT-I-BC) and 
(SW-I-BC) on each image. Generally, we see that AT is still able to signifi- 
cantly reduce the total number of evaluations required even though the number 
of faces in the images is apriori unknown. Figure [5jc) , shows a similar result in 
terms of time. Again, computational gains are of one order of magnitude over 
the entire testset. 

Notice in figures[5jb) and[5|c) that for images of the same pose space size, the 
number of classifier evaluations and time necessary for (AT-I-BC) to terminate 
vary. This variance is due to the fact that (AT-I-BC) stops when the estimate of 
having a face in the image is very low: 7rt(Ai^i) < e = 10^^. Hence, in images 
which contain many face-like features, the algorithm will need to visit many 
more locations to see if faces are still present. This is precisely what is observed 
in figures [sjb) andjsj^c). 

5 Conclusion 

We have proposed an Active Testing framework in which one can perform fast 
face detection and localization in images. In order to find faces, we use a coarse- 
to-fine method, while sampling subwindows which maximize information gain. 
This allows us to quickly find the face pose by focusing on regions of interest, 
and pruning large image regions. We show through a series of experiments, that 
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(a) (b) (c) 

Figure 5: (a) ROC for both the sHding window and the Active Testing approaches on 
the MIT+CMU frontal face dataset. The AT algorithm achieves similar performance 
levels to the exhaustive search, (b) Number of classifier evaluations for each image 
in the testset. Clearly the AT approach does not suffer as much from the increase in 
pose space, (c) Time performance for each image in the testset. 

the active testing framework can be used to significantly reduce the number 
of classifier evaluations when searching for an object. Exponential speedup is 
observed when detecting and locating faces compared to the traditional sliding 
window approach (particularly on large image sizes), without significant loss in 
performance levels, indicating that this method is scalable to larger image sizes. 
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